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Abstract 

The effectiveness of four methods of handling missing data in reproducing the target sample 
covariance matrix and mean vector was tested using three levels of incomplete cases: 30%, 50%, 
and 70%. Data was selected from the NELS (National Educational Longitudinal Study) database. 
Three levels of sample size (500, 1000, 2000) were used. The assumption of missing completely 
at random was violated in all samples. Results indicate Iistwise deletion was most effective in 
replicating the target mean vector and covariance matrix. 
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Effectiveness of Four Methods of Handling Missing Data 
Using Samples from a National Database 

When data is analyzed in survey research, often there are missing values. If the 
mechanism causing the missing values is known, the solution to this problem may be incorporated 
in the study. Inevitably, however, when data are collected by survey, subjects may fail to answer 
some questions for reasons unknown to the researcher. Ignoring this problem may lead to 
analysis of data that is of dubious value. 

In addition, different methods of handling missing values may produce different results. 
When Jackson (1968) entered data on all the available variables in a discriminant analysis, the 
significance of the regression coefficients of individual variables, as well as the interpretation of 
the importance of these variables, changed with the missing value method used. Witta and Kaiser 
(1991) also reported that the regression coefficients and total variance accounted for by the 
variables changed depending on the method used to handle missing values. After re-analyzing 
three studies of private/public school achievement, Ward and Clark III (1991) concluded that the 
method used to handle missing data influenced the outcome of these studies. 

In using the National Educational Longitudinal Study of 1988 database to investigate the 
effects of part-time work on school outcomes Singh and Ozturk (1999) eliminated more than half 
of the selected cases by listwise deletion of the incomplete data. Which leads to the question, was 
listwise deletion an appropriate method of for handling the missing data or, would another method 
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Statement of the Problem 

The purpose of the current study was to investigate the effectiveness of four methods of 
handling missing data using the 26 variables in the Singh and Ozturk study. Effectiveness was 
defined as the probability of accurately reproducing the true covariance matrix and mean vector. 
Effectiveness of the missing data methods was assessed by manipulating the proportion of cases 
containing missing values and the sample size. The missing data methods studied were listwise 
deletion, pairwise deletion, regression and expectation maximization. Sample sizes investigated 
were 500, 1000, and 2000. The proportion of incomplete cases in each sample were 30%, 50%, 
and 70%. 

Until recently, the only methods available with popular statistical computer software 
focused on handling the missing data problem by deleting subjects with incomplete information, 
deleting the missing values, or replacing the missing value with some reasonable estimate. Now, 
however, new subroutines are available to provide more assistance in handling missing data and 
providing analysis choices using iterative regression or expectation maximization (EM) 
procedures. These relative new methods (in current software) also provide the possibility of 
specifying the model to be used (i.e., multivariate normality, adding a randomly selected error). 

Methods Studied 

Listwise Deletion 

Listwise deletion is probably the most frequently used method of han dling missing data 
and is available as a default option in several statistical software programs including. This method 
discards cases with a missing value on any variable and thus is very wasteful of data. Listwise 
deletion, however, has been shown to be effective with low average intercorrelation, less than 
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four variables and a small proportion of missing values (Chan, et.al, 1976; Haitovsky, 1968; 
Timm, 1970). The assumption of missing completely at random is crucial to the use of this 
method. It is more likely, however, to find the complete sample different in important ways from 
the incomplete sample (Little & Rubin, 1987). Problems for a researcher using this method 
include a reduction in power and an increase in standard error due to reduced sample size and the 
possible elimination of sub-populations. 

Pairwise Deletion 

When using pairwise deletion, covariances are computed between all pairs of variables 
having both observations, eliminating those that have a missing value for one of the two variables 
" (Glasser, 1964). Means and variances are computed on all available observations. The 
^assumption made is that the use of the maximum number of pairs and all the individual *■*.. 

observations yield more valid estimates of the relationship between the variables. It is assumed 
that when two variables are correlated, information on one improves the estimates of the other 
variable. It is also assumed that the pairs are a random subset of the sample pairs. If these 
assumptions are true, pairwise deletion produces unbiased estimates of the variable means and 
variances (Hertel, 1976). When missing data are not missing completely at random, however, the 
correlation matrix produced by pairwise deletion may not be Gramian (Norusis, 1988b). 

Marsh (1998) investigated the estimates produced when using pairwise deletion for 
randomly missing data. From this study, which included five levels of missing data and three 
sample sizes, Marsh concluded parameter variability was explained, parameter estimates were 
unbiased, and only one covariance matrix was nonpositive definite. 
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Regression as an imputation method has many variations. The regression methods rely on 
information contained in non-missing values of other variables to provide estimates of missing 
values. As the average intercorrelation and the number of variables from which these methods 
can obtain information increases, the regression methods, theoretically, perform better. Too many 
variables, however, can cause problems with over prediction (Kaiser & Tracy, 1988) and too high 
an average intercorrelation can result in a singular matrix. In these cases, regression does not 
perform well. 

Variations in the regression methods include differences in methods of developing the 
initial correlation matrix (listwise deletion, pairwise deletion, and mean substitution) and the 
presence or absence of iteration procedures. Differences in regression methods also include the 
use of randomly selected residuals for iterations and assumptions of a normal distribution. 
Theoretically, the more variables considered that provide additional information, the better the 
estimate. Mundfrom and Whitcomb ( 1 998) investigated the effects of using mean substitution, 
hot-deck imputation, and regression imputation on classification of cardiac patients. Mean 
substitution and hot-deck imputation correctly classified patients more frequently than regression 
imputation. 

Expectation Maximization 

Dempster, Laird, and Rubin (1977) recommended the use of the EM algorithm which 
imputes estimates simultaneously in an iterative procedure. The E step of this algorithm finds the 
conditional expectation of the missing values. The M step performs maximum likelihood 
estimation as if there were no missing data. The primary difference between this procedure and 
the regression procedure is that the values for the missing data are not imputed and then iterated. 
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The missing values are functions based on the conditional expectation (Little & Rubin, 1987). 

This method of handling missing data represents a fundamental shift in the way of thinking 
about missing data (Schafer & Olsen, 1998). 

Pattern of Missing Values 

All of the missing data handling procedures discussed require data missing at random 
(MAR) or missing completely at random (MCAR). Yet Cohen and Cohen (1983) suggested that 
in survey research the absence of data on one variable may be related to another variable and may 
be due to the value of the variable itself. When investigating simultaneously missing values, Witta 
(1996/97) found concurrently missing values (jK.OOl) in three of four samples using data from a 
national database. 

Schafer and Olsen (1998), however, argue convincingly that “every missing-data method 
must make some largely untestable statistical assumptions about the manner in which the missing 
values were lost” (p551). Consequently, when analyzing real data, researchers typically assume 
missing at random. 

Procedure 

All high school seniors who had reported working during their senior year of high school 
and for whom base-year and first follow-up data were available were included in this study. The 
initial sample contained the 26 variables used in the Singh and Ozturk study for 4664 subjects. 
These subjects were split into three populations: those containing one or more missing values but 
less than 14 (n=1542), those containing more than 13 missing values (n=19), and those containing 
no missing values on any variable (n=3103). The 19 subjects having missing values for more than 
half the variables were eliminated from further analysis. The remaining two populations (n=4645) 
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were used to create samples for analysis. 

Creating Test Samples 

A sample consisting of 2000 cases was randomly selected from the non-missing 
population. This sample was duplicated twice resulting in three identical samples of 2000 cases 
containing no missing values. These samples were used to provide estimates of the target (true) 
covariance matrices and mean vectors. 

A sample of 1400 cases was randomly select from the missing population. These cases 
were used to replace an equal number of randomly selected cases from one of the target samples. 
This provided a test sample of 2000 with 70% of the cases containing missing values. It was 
assumed that the replacement incomplete cases were similar to the complete cases that were 
removed. This process was repeated with the second target sample to provide a test sample with 
50% (1000) of the cases containing missing values. The process was repeated again with the third 
target sample to provide a test sample with 30% (600) of the cases containing missing values. 

T his entire procedure was repeated twice to provide test samples with 30%, 50%, and 
70% of the cases containing missing values in test samples of 1000 and 500 cases. Thus, 9 test 
samples were created. 

Analysis 

Covariance matrices and mean vectors for the missing data handling methods were 
produced by the missing data subroutine in SPSS. The test for missing completely at random and 
pattern of missing data was also produced by this subroutine. The variable means produced by 
each method were compared with the corresponding mean values of the target sample using the 
MANOVA (multivariate analysis of variance) subroutine in SPSS for every method except 
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pairwise deletion. 

Because the MANOVA subroutine does not accept pairwise deletion, the vector of 
variable means produced by pairwise deletion was compared to that of the target sample using 
Quattro Pro. The mean vector tested for pairwise deletion was the mean given for all values of 
each variable. Multi-sample analysis in LISREL (Joreskog & Sorbom, 1989, chap. 9) was used 
to test the equality of the covariance matrices produced by various missing data handling methods 
to the covariance matrix of the target sample. 

Results 

Randomness of Missing Values 

When variables from the total sample were tested for no difference in variable based upon 
missingness of another variable, results suggested the missing data may not be missing at random 
and is not missing completely at random. For example, cases not missing a standardized test (n> 
3344) had average reported grades ranging from 6.4 to 7.2 (high=low grade). The average 
reported grades for cases missing a standardized test (n>698) ranged from 7.0 to 7.5. The average 
grade reported for a given missing standardized test was always at least 0.2 points higher (lower 
grade) than the non-missing equivalent. 

In addition, none of the nine samples used in the current study contained data missing 
completely at random. The frequency of simultaneously missing variables for each sample is 
depicted in Figure 1. The category of ‘Std Test’ consists of four simultaneously missing 
standardized test variables (History, Math, Reading, and Science). The standardized test variables 
were also missing in conjunction with missing values for grades which is depicted in Figure 1 as 
‘Grd & Test’. The four grade variables were also missing simultaneously. If a variable did not 
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contain a missing value for 10% of the sample cases, it was included in the ‘Other’ category. In 
each sample, the majority of the cases containing missing values consisted of concurrently missing 
values for standardized tests (the categories ‘Std Test’ and ‘Grd & Test’). 



Insert Figure 1 About Here 



Covariance Matrix Reproduction 

Surprisingly, all four missing data methods adequately reproduced ( % 2 p>.05) the target 
sample covariance matrix when 30% or 50% of the cases contained missing values regardless of 
sample size 1 . In addition, as depicted in Table 1, the goodness of fit index in all cases was above •: 

0. 98 and the root mean square residual was less than 1 except for two cases. • ^ 

When 70% of the cases contained missing values, however, only the covariance matrix 
produced by the EM algorithm passably reproduced the target sample matrix when the sample 
size was 500. When the sample size was 1000 or 2000 with 70% of the cases containing missing 
values, no method adequately reproduced the target sample covariance matrix as measured by chi- 
square (X 2 , p<.05). The goodness of fit index for these conditions remained at an acceptable level 
of 0.96 or higher. The root mean square residual also remained relatively small as shown in Table 

1 . 



'To prevent discrepancies in sample size comparison, the n for testing the covariance 
matrices produced by Listwise and Pairwise deletion was enter in LISREL as the target n (i.e. if 
the target sample contained 500 cases, the n entered for the listwise deletion covariance matrix 
was 500). 



0 




li 



Missing Data Method 



11 



Insert Table 1 About Here 



Mean Vector Tests 

When 30% of the cases contained missing values, all missing data methods adequately 
reproduced the target sample mean vector as measured by F (p<.05) regardless of sample size 2 as 
depicted in Table 2. In addition, less than 2% of the difference in mean vectors could be explained 
by missing data method group as measured by eta square. 

When 50% of the cases contained missing values and the sample size was 500, all missing 
data methods adequately reproduced the target sample mean vector again. However, the variance 
accounted for by missing data method had increased to approximately 3% when the target sample 
mean vector was contrasted to the vector produced by the EM algorithm or the vector produced 
by regression. When the sample size increased to 1000, all methods except the EM algorithm 
adequately reproduced the target sample mean vector (p<.05). The variance accounted for by 
missing data method was again 2% or less. When the sample size increased to 2000, only listwise 
deletion adequately reproduced the target sample mean vector. The variance in mean vectors 
accounted for by group was again 2% or less. 

When the proportion of cases containing missing values increased to 70%, only listwise 
deletion adequately reproduced the target sample mean vector in all conditions. When the sample 



2 Because sample size varies by variable when pairwise deletion is used, the pairwise 
deletion n was set to the n of listwise deletion for all calculations. 
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size was 500 or 1000, neither the EM algorithm nor the regression procedure effectively 
reproduced the target mean vector (p <.01). When the sample size increased to 2000, only 
listwise deletion was effective. In addition, the variance in mean vectors accounted for by group 
differences had increased to 5% in some instances as presented in Table 2. 



Insert Table 2 About Here 



Discussion and Conclusions 

When 30% of the cases in a sample were incomplete, all missing data methods tested 
adequately reproduced the target sample covariance matrix and mean vector regardless of sample 
size. This would imply that if only a few cases were incomplete in a sample, the choice of method 
used to handle missing data could be made based upon considerations of loss of data (in the 
deletion methods) or other substantive reasons. When, however, 50% of the cases were 
incomplete, only listwise and pairwise deletion were effective under all conditions. While this 
could be attributed to reduction in sample size, only 1% of the variance between mean vectors 
could be explained by the listwise deletion method, 1-2% by pairwise deletion, and 2-3% by the 
other methods. This finding suggests that listwise deletion would be the method of choice 
regardless of reduction in sample size. 

Although no method adequately reproduced the target sample covariance matrix when 
70% of the cases were incomplete as measured by x\ the goodness of fit index was adequate for 
all methods. The root mean square residual results indicated an adequate fit for the listwise 
deletion and regression methods and a tolerable fit for pairwise deletion and the EM algorithm. 
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Listwise deletion, however, consistently reproduced the mean vector across all conditions. Thus, 
this finding would also suggest that listwise deletion would be the method of choice. 

This study was limited to one sample size and proportion of incomplete cases for each 
test. Consequently, results may be specific to these samples. In addition, it was assumed the 
replacement incomplete cases were similar to the complete cases they replaced. If this assumption 
was not valid, these results may change with the next sample. These limitations, however, did not 
influence the pattern of missing values. In all instances the missing data were not missing 
completely at random. Because there is no specific test for missing at random (Hill, 1997), no 
conclusion concerning it can be made. However, examination of the data provided suggests that 
this assumption is also violated. 

The most prevalent missingness pattern existed in the concurrently missing values for 
standardized tests and grades. This pattern may explain why listwise deletion fared better than 
other methods. If the most highly related variables (standardized test scores) contain concurrently 
missing values, any method relying on other variables to estimate a variable suffers. If, in addition, 
these concurrently missing values are also missing simultaneously with another variable (grades) 
that should be related, the situation becomes even worse. Thus, an assumption for use of each 
missing data method test was violated in each sample. 

The most surprising result of this study was the relatively effective performance of each 
missing data method when considering the violation of the missing completely at random and 
missing at random assumptions. The failure to satisfy the randomness assumption, however, is the 
primary finding of importance in this study. This finding suggests that other samples selected from 
the NELS database would also contain non-randomly missing values. In light of this finding it 
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would be suggested that future missing data research focus on methods to overcome the 
randomness limitation. Researchers in all areas are cautioned to examine the data prior to any 
analysis. Before making any decisions concerning method of handling missing data, the pattern of 
missingness must be scrutinized. 
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Tables & Figures 

Table 1 Comparison of Missing Data Method Covariance Matrix to Target Matrix 

Table 2 Contrast of Missing Data Method Mean Vector with Target Mean Vector 
Figure 1 Patterns of Missing Values 
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Test of Missing Data Method Covariance Matrix to Target Matrix 
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